This project builds a predictive model that estimates the count of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. After an initial variable inspection, data imputation and transformation, three types of count regression models were prepared and compared on test data: Poisson regression, negative binomial regression, and multiple linear regression. Based on regression performance metrics, the best model was suggested and applied on the evaluation dataset.
The training dataset contains 12795 observations of 16 variables (one index, one response, and 14 predictor variables).
Each record (row) represents a range of parameters of a wine type being sold such as its chemical properties. The continuous response variable TARGET represents the number of cases of wine that are sold as tasting samples to restaurants and wine stores around the United States.
The variables are:
Summaries for the individual variables are provided below.
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
From the summaries and the chart above we can see that all variables are continuous and that multiple variables have missing data, but the amount of NAs is not very high with the exception of the STARS variable.
A check for near-zero variance did not show a positive result for any variable.
Per-variable distribution analysis is provided below (excluding the INDEX variable, which is immaterial to the analysis and would not be regarded further).
Summary of the findings from the univariate analysis:
The pairwise correlations between the continuous variables are displayed below
The pairwise scatterplots of the most highly correlated variables vs. the response are provided below
Summary of the findings from the univariate analysis:
TARGET, namely: STARS, AcidIndex, LabelAppeal, VolatileAcidity, and Alcohol.STARS and LabelAppeal (r=+0.33), where it is not strong enough to cause concern)The exploratory analysis shows that the distributions of many predictors are suspiciously symmetrical around zero. Moreover, these predictor variables cannot physically be negative (e.g. a wine cannot have a negative alcohol or chloride content).
Therefore, we shall test if the negative sign represents an error in the data entry and can be ignored without losing the correlations in the dataset.
A correlation plot below shows the pairwise correlations in the data where the absolute values were taken for the columns: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, Alcohol.
We can see comparing to the previous chart that the pairwise correltaion coefficients have hardly changed and have not changed their direction, which confirms the idea of a minus sign being a data entry error.
As an example, compare the scatterplot of Alcohol vs the TARGET for raw and absolute values (a random sample of 5000 observations is displayed):
We can clearly see that removing the minus sign simply shifts the distribution of the Alcohol variable along the X axis and does not affect the relationship with the response variable.
As shown at the beginning of the exploratory analysis, several variables have missing data. The missing values are imputed using the functions of the mice package.
The chart below shows the density of the imputed data for each variable is showed in magenta while the density of the observed data is showed in blue. We can see that the imputation does not result in a very different distribution and can thus proceed with the imputed dataset.
##
## iter imp variable
## 1 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 2 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 3 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 4 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 5 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 6 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 7 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 8 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 9 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
## 10 1 pH ResidualSugar Chlorides FreeSulfurDioxide Alcohol TotalSulfurDioxide Sulphates STARS
From the density plots below we can also see that after the transformation and imputation several predictors do not show a normal distribution.
After iterating on power transformations, the following predictors were transformed using the Box-Cox method: FixedAcidity, VolatileAcidity, CitricAcid, TotalSulfurDioxide, Sulphates.
The \(\lambda\) values applied are shown in the table below.
| FixedAcidity | VolatileAcidity | CitricAcid | TotalSulfurDioxide | Sulphates |
|---|---|---|---|---|
| 0.9616419 | 1.110294 | 0.9378151 | 0.9779128 | 1.002414 |
Plotting the distributions of the Box-Cox transformed variables, we can see the that their distributions have become closer to normal.
This has resulted in improved correlation with the response for the transformed variables, as displayed in the plot below.
In this step, multiple regression models are be built to predict the TARGET count of wine cases ordered.
The models are built on the 80% sample of the training data, and the remaining 20% are used to assess the model performance on out-of-sample data in order to avoid choosing an overfitting model as the best model. The model performance will be compared to each other using RMSE as a metric of prediction quality.
Poisson regresison models count data. We build two models: the full model, and a model with the following set of predictors that showed at least moderate correlation with the response: LabelAppeal, STARS, AcidIndex, VolatileAcidity, TotalSulfurDioxide, Alcohol. This subset covers both the bottle appeal, the taste rating, and some of the main chemical properties of a wine.
The model in-sample performance is provided below.
Model summary and performance
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8628 -0.5105 0.2128 0.6343 2.6883
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.108e+00 2.271e-01 9.286 < 2e-16 ***
## Density -3.406e-01 2.174e-01 -1.566 0.117308
## LabelAppeal 2.148e-01 6.628e-03 32.413 < 2e-16 ***
## AcidIndex -1.263e-01 4.993e-03 -25.301 < 2e-16 ***
## pH -1.815e-02 8.358e-03 -2.171 0.029921 *
## ResidualSugar -1.731e-05 2.279e-04 -0.076 0.939455
## Chlorides -3.259e-02 2.458e-02 -1.326 0.184898
## FreeSulfurDioxide 1.206e-04 5.207e-05 2.316 0.020557 *
## Alcohol 6.101e-03 1.566e-03 3.896 9.79e-05 ***
## STARS 1.691e-01 6.354e-03 26.618 < 2e-16 ***
## FixedAcidity -1.998e-04 6.142e-03 -0.033 0.974050
## VolatileAcidity -2.767e-01 3.766e-02 -7.346 2.04e-13 ***
## CitricAcid 1.362e-01 2.360e-02 5.772 7.82e-09 ***
## TotalSulfurDioxide 2.759e-02 4.951e-03 5.573 2.51e-08 ***
## Sulphates -1.017e-01 2.847e-02 -3.574 0.000352 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18338 on 10237 degrees of freedom
## Residual deviance: 14956 on 10223 degrees of freedom
## AIC: 40522
##
## Number of Fisher Scoring iterations: 5
| Model | Explained.Variance | RMSE |
|---|---|---|
| poisson_full | 0.1843865 | 1.652511 |
From the model summary and the diagnostic plot we can see the following:
1) The errors are not quite normally distributed
2) Several variables are not significant
Interpretation of the regression coefficients for the significant variables
The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.
| Variable | Coefficient | ConfLevel |
|---|---|---|
| (Intercept) | 2.10831 | 1.00000 |
| VolatileAcidity | -0.27667 | 1.00000 |
| LabelAppeal | 0.21482 | 1.00000 |
| STARS | 0.16914 | 1.00000 |
| CitricAcid | 0.13621 | 1.00000 |
| AcidIndex | -0.12632 | 1.00000 |
| Sulphates | -0.10174 | 0.99965 |
| TotalSulfurDioxide | 0.02759 | 1.00000 |
| pH | -0.01815 | 0.97008 |
| Alcohol | 0.00610 | 0.99990 |
| FreeSulfurDioxide | 0.00012 | 0.97944 |
We can interpret the model coefficients as follows:
- VolatileAcidity, AcidIndex, Sulphates, and Chlorides have the strongest negative impact on the response variable. All of these variables describe taste parameters of a wine. The interpretation of their negative impact is that high concentrations of individual components are detrimental to the overall taste. - LabelAppeal, STARS, and CitricAcid have a positive impact on the number of ordered cases. This confirms the idea that bottle design and expert rating positively impact wine sales. Citric acid in small quantities can add citrus notes to wine taste, and is sometimes added to enhance the taste. This can be an explanation of a positive effect this variable has on the number of orders. - While statistically significant, the Alcohol content of a wine has no pratically relevant effect on the number of ordered cases.
For the second model, only the following variables are considered:
LabelAppeal, STARS, AcidIndex, VolatileAcidity, TotalSulfurDioxide, Alcohol.
Model summary and performance
##
## Call:
## glm(formula = TARGET ~ LabelAppeal + STARS + AcidIndex + VolatileAcidity +
## TotalSulfurDioxide + Alcohol, family = poisson, data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7228 -0.5011 0.2192 0.6330 2.7294
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.731607 0.060232 28.749 < 2e-16 ***
## LabelAppeal 0.215639 0.006620 32.574 < 2e-16 ***
## STARS 0.169595 0.006351 26.705 < 2e-16 ***
## AcidIndex -0.124833 0.004909 -25.427 < 2e-16 ***
## VolatileAcidity -0.288878 0.037617 -7.679 1.6e-14 ***
## TotalSulfurDioxide 0.028403 0.004944 5.745 9.2e-09 ***
## Alcohol 0.006078 0.001567 3.879 0.000105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18338 on 10237 degrees of freedom
## Residual deviance: 15017 on 10231 degrees of freedom
## AIC: 40567
##
## Number of Fisher Scoring iterations: 5
| Model | Explained.Variance | RMSE |
|---|---|---|
| poisson_reduced | 0.1810526 | 1.657139 |
Looking at the model summary and performance on the in-sample data we can see that now all coefficients are highly significant, and the explained variance is reduced only slightly (from 18.35% in the full model to 18.14% in the reduced model), while RMSE stayed virtually the same.
Interpretation of the regression coefficients for the significant variables
| Variable | Coefficient | ConfLevel |
|---|---|---|
| (Intercept) | 1.73161 | 1.0000 |
| VolatileAcidity | -0.28888 | 1.0000 |
| LabelAppeal | 0.21564 | 1.0000 |
| STARS | 0.16960 | 1.0000 |
| AcidIndex | -0.12483 | 1.0000 |
| TotalSulfurDioxide | 0.02840 | 1.0000 |
| Alcohol | 0.00608 | 0.9999 |
The interpretation of the coefficients, their direction and magnitude have stayed the same as in the full model.
Negative binomial regression can be used for over-dispersed count data, that is when the conditional variance exceeds the conditional mean. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion.[1]
The difference to the Poisson models built above would be in the confidence intervals for the regression coefficients.
The first negative binomial model considered is a reduced version of the full model where the following variables - the ones not found significant in the full poisson model - are excluded: Density, ResidualSugar, FreeSulfurDioxide, and FixedAcidity.
The model in-sample performance is provided below.
Model summary and performance
##
## Call:
## glm.nb(formula = TARGET ~ . - Density - ResidualSugar - FreeSulfurDioxide -
## FixedAcidity, data = df_train, init.theta = 35493.57398,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8956 -0.5134 0.2156 0.6359 2.7251
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.786339 0.073538 24.292 < 2e-16 ***
## LabelAppeal 0.215126 0.006625 32.471 < 2e-16 ***
## AcidIndex -0.126740 0.004934 -25.688 < 2e-16 ***
## pH -0.018327 0.008359 -2.192 0.028353 *
## Chlorides -0.032543 0.024582 -1.324 0.185563
## Alcohol 0.006091 0.001566 3.889 0.000100 ***
## STARS 0.169015 0.006352 26.608 < 2e-16 ***
## VolatileAcidity -0.275714 0.037651 -7.323 2.43e-13 ***
## CitricAcid 0.136350 0.023593 5.779 7.50e-09 ***
## TotalSulfurDioxide 0.027616 0.004950 5.579 2.41e-08 ***
## Sulphates -0.103293 0.028462 -3.629 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(35493.57) family taken to be 1)
##
## Null deviance: 18336 on 10237 degrees of freedom
## Residual deviance: 14963 on 10227 degrees of freedom
## AIC: 40524
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 35494
## Std. Err.: 66351
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -40499.74
| Model | Explained.Variance | RMSE |
|---|---|---|
| nb_1 | 0.1839566 | 1.653338 |
As expected, all coefficients provided by the model are statistically significant and are close to the ones estimated by the full poisson model. Model performance is also close to that of the full poisson model.
Interpretation of the regression coefficients for the significant variables
The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.
| Variable | Coefficient | ConfLevel |
|---|---|---|
| (Intercept) | 1.78634 | 1.00000 |
| VolatileAcidity | -0.27571 | 1.00000 |
| LabelAppeal | 0.21513 | 1.00000 |
| STARS | 0.16901 | 1.00000 |
| CitricAcid | 0.13635 | 1.00000 |
| AcidIndex | -0.12674 | 1.00000 |
| Sulphates | -0.10329 | 0.99972 |
| TotalSulfurDioxide | 0.02762 | 1.00000 |
| pH | -0.01833 | 0.97165 |
| Alcohol | 0.00609 | 0.99990 |
We can see that the coefficients for the shared variables between this reduced negative binomial and the full poisson model are only slightly different, which is caused by the absence of several predictors in this model vs. the full model.
The interpretation of the coefficients, their direction and magnitude have stayed the same as in the full model.
The second negative binomial model is considering only two predictors that are measurable without a chemical analysis of a wine: STARS and LabelAppeal. The goal of building this model is to test it against the first negative binomial model to assess if the chemical composition predictors play an important role.
The model in-sample performance is provided below.
Model summary and performance
##
## Call:
## glm.nb(formula = TARGET ~ STARS + LabelAppeal, data = df_train,
## init.theta = 22381.46543, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5653 -0.4963 0.2734 0.6965 2.4124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.691926 0.014937 46.32 <2e-16 ***
## STARS 0.184826 0.006291 29.38 <2e-16 ***
## LabelAppeal 0.209143 0.006605 31.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(22381.47) family taken to be 1)
##
## Null deviance: 18336 on 10237 degrees of freedom
## Residual deviance: 15867 on 10235 degrees of freedom
## AIC: 41412
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 22381
## Std. Err.: 66815
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -41403.93
| Model | Explained.Variance | RMSE |
|---|---|---|
| nb_2 | 0.1346434 | 1.718995 |
We can see a reduction in the explained variance in this model vs. the first one (from 18.3% to 13.2%), and a growth of the RMSE. In order to confirm the significance of the chemical variables to the model, we compare the two models using an Likelihood Ratio Test (as the second negative binomial model is nested in the first one) [1,2].
## Likelihood ratio tests of Negative Binomial Models
##
## Response: TARGET
## Model
## 1 STARS + LabelAppeal
## 2 (Density + LabelAppeal + AcidIndex + pH + ResidualSugar + Chlorides + FreeSulfurDioxide + Alcohol + STARS + FixedAcidity + VolatileAcidity + CitricAcid + TotalSulfurDioxide + Sulphates) - Density - ResidualSugar - FreeSulfurDioxide - FixedAcidity
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 22381.47 10235 -41403.93
## 2 35493.57 10227 -40499.74 1 vs 2 8 904.1972 0
From the output we can see that the Likelihood Ratio statistic is very significantly different from zero for the model with more predictors (Probability(LR Stat = 938.57 | LR Stat = 0) = 0). This means that the chemical variables carry relevant information for the prediction of the response.
Interpretation of the regression coefficients for the significant variables
The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.
| Variable | Coefficient | ConfLevel |
|---|---|---|
| (Intercept) | 0.69193 | 1 |
| LabelAppeal | 0.20914 | 1 |
| STARS | 0.18483 | 1 |
Interestingly, the coefficient for the STARS and LabelAppeal variables are only slightly different from the first negative binomial model.
The interpretation of the coefficients, their direction and magnitude have stayed the same as in the full poisson model.
In order to test if the poisson model (log link function) is really the best choice of modeling the relationship between the predictors and the response, we will test two linear models: a full model, and a model that is automatically selected using stepwise backwards approach.
The model in-sample performance is provided below.
Model summary and performance
##
## Call:
## lm(formula = TARGET ~ ., data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9539 -0.7296 0.3857 1.1241 4.9667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6181135 0.6466711 8.688 < 2e-16 ***
## Density -0.9105715 0.6203816 -1.468 0.142200
## LabelAppeal 0.6531062 0.0189284 34.504 < 2e-16 ***
## AcidIndex -0.3359182 0.0127028 -26.444 < 2e-16 ***
## pH -0.0463445 0.0239363 -1.936 0.052876 .
## ResidualSugar -0.0002924 0.0006533 -0.448 0.654434
## Chlorides -0.1064593 0.0700150 -1.521 0.128411
## FreeSulfurDioxide 0.0003354 0.0001502 2.234 0.025536 *
## Alcohol 0.0202356 0.0044948 4.502 6.81e-06 ***
## STARS 0.5454545 0.0188179 28.986 < 2e-16 ***
## FixedAcidity -0.0000681 0.0175237 -0.004 0.996899
## VolatileAcidity -0.8091070 0.1083263 -7.469 8.73e-14 ***
## CitricAcid 0.4151552 0.0677770 6.125 9.38e-10 ***
## TotalSulfurDioxide 0.0796650 0.0140483 5.671 1.46e-08 ***
## Sulphates -0.2967962 0.0819140 -3.623 0.000292 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.65 on 10223 degrees of freedom
## Multiple R-squared: 0.2682, Adjusted R-squared: 0.2672
## F-statistic: 267.6 on 14 and 10223 DF, p-value: < 2.2e-16
From the model summary we can see that almost the same variables are not significant for the linear model as for the poisson model.
| Model | Adj..R.Squared | RMSE |
|---|---|---|
| lm_full | 0.2671715 | 1.648819 |
We can see that while the Adjusted R-Squared metric (26.5%) appears higher than the explained deviance of the poisson/negative binomial models, the RMSE of the linear model shows a higher error on the in-sample data.
Interpretation of the regression coefficients for the significant variables
The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.
| Variable | Coefficient | ConfLevel |
|---|---|---|
| (Intercept) | 5.61811 | 1.00000 |
| VolatileAcidity | -0.80911 | 1.00000 |
| LabelAppeal | 0.65311 | 1.00000 |
| STARS | 0.54545 | 1.00000 |
| CitricAcid | 0.41516 | 1.00000 |
| AcidIndex | -0.33592 | 1.00000 |
| Sulphates | -0.29680 | 0.99971 |
| TotalSulfurDioxide | 0.07966 | 1.00000 |
| Alcohol | 0.02024 | 0.99999 |
| FreeSulfurDioxide | 0.00034 | 0.97446 |
The coefficients provided by the linear model are different, as they relate predictors directly to the values of TARGET vs. via an exponent, as in the poisson model. However, the order of importance, the direction and relative magnitudes of the coefficients allow to make the same conclusions as in the full poisson model.
The second linear model is built using an automated stepwise model selection approach in which all non-significant predictors are eliminated. The eliminated predictors are shown in the table below.
| Variable | Relevant.Predictor |
|---|---|
| Density | FALSE |
| ResidualSugar | FALSE |
| Chlorides | FALSE |
| FixedAcidity | FALSE |
| (Intercept) | TRUE |
| LabelAppeal | TRUE |
| AcidIndex | TRUE |
| pH | TRUE |
| FreeSulfurDioxide | TRUE |
| Alcohol | TRUE |
| STARS | TRUE |
| VolatileAcidity | TRUE |
| CitricAcid | TRUE |
| TotalSulfurDioxide | TRUE |
| Sulphates | TRUE |
The resulting linear model excludes the following predictors: Density, ResidualSugar, FreeSulfurDioxide, and FixedAcidity.
The model in-sample performance is provided below.
Model summary and performance
##
## Call:
## lm(formula = TARGET ~ . - Density - ResidualSugar - FreeSulfurDioxide -
## FixedAcidity, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9982 -0.7257 0.3845 1.1253 4.9280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.755705 0.204763 23.225 < 2e-16 ***
## LabelAppeal 0.654241 0.018926 34.568 < 2e-16 ***
## AcidIndex -0.337231 0.012504 -26.969 < 2e-16 ***
## pH -0.046994 0.023939 -1.963 0.049664 *
## Chlorides -0.106300 0.070020 -1.518 0.129012
## Alcohol 0.020221 0.004495 4.498 6.92e-06 ***
## STARS 0.544964 0.018813 28.967 < 2e-16 ***
## VolatileAcidity -0.806482 0.108334 -7.444 1.05e-13 ***
## CitricAcid 0.417240 0.067782 6.156 7.76e-10 ***
## TotalSulfurDioxide 0.079655 0.014046 5.671 1.46e-08 ***
## Sulphates -0.301990 0.081899 -3.687 0.000228 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.65 on 10227 degrees of freedom
## Multiple R-squared: 0.2676, Adjusted R-squared: 0.2669
## F-statistic: 373.8 on 10 and 10227 DF, p-value: < 2.2e-16
From the model summary we can see that almost the same variables are not significant for the linear model as for the poisson model.
| Model | Adj..R.Squared | RMSE |
|---|---|---|
| lm_stepwise | 0.2669304 | 1.649413 |
We can see that while the Adjusted R-Squared and RMSE metrics have hardly moved vs. the full model indicating that the excluded predictors did not add any additional explanatory power to the model.
Interpretation of the regression coefficients for the significant variables
The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.
| Variable | Coefficient | ConfLevel |
|---|---|---|
| (Intercept) | 4.75570 | 1.00000 |
| VolatileAcidity | -0.80648 | 1.00000 |
| LabelAppeal | 0.65424 | 1.00000 |
| STARS | 0.54496 | 1.00000 |
| CitricAcid | 0.41724 | 1.00000 |
| AcidIndex | -0.33723 | 1.00000 |
| Sulphates | -0.30199 | 0.99977 |
| TotalSulfurDioxide | 0.07966 | 1.00000 |
| pH | -0.04699 | 0.95034 |
| Alcohol | 0.02022 | 0.99999 |
The direction and interpetation of the coefficients is the same as in the full linear model.
In this step, the performance of all six models is be compared based on RMSE on the out-of sample data. The best performing model is chosen as the final one.
| model | RMSE |
|---|---|
| Poisson full | 1.660830 |
| Poisson reduced | 1.663281 |
| Neg. binom. 1 (large) | 1.661722 |
| Neg. Binom 2 (2 variables) | 1.730915 |
| Linear full | 1.665180 |
| Linear stepwise reduced | 1.666590 |
We can see that on out-of-sample data, the linear model with the automatically selected set of predictors has performed on par with the full linear and full poisson models.
Comparing the fitted values against test data we can see that the linear model is fairly constant in the errors at each level of the response. However, it fails to capture the highest levels (count of 7 and 8 cases). The poisson model has a visibly higher error on the zero count of the response, but does provide several predictions for the highest levels of demand.
Due to its simplicity in terms of the number of predictors and the interpretation, the reduced linear model is chosen as the best one for generating predictions on the evaluation data set.
However, further tuning of the negative binomial family models (e.g. using zero-inflated model) could provide better precision of the predictions, especially for the lower end of the distribution of the target variable.
Predictions on the evaluation dataset
The evaluation dataset is transformed in the same way as the training dataset in order to provide correct predictions. NA values in the evaluation data will cause missing predictions.
Predictions on the evaluation dataset are made using the model lm_stepwise.
The output of the model on the evaluated data is available under the following URL: